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Abstract 

We learn the structure of a Markov Network between two groups of random vari¬ 
ables from joint observations. Since modelling and learning the full MN structure may 
be hard, learning the links between two groups directly may be a preferable option. We 
introduce a novel concept called the partitioned ratio whose factorization directly asso¬ 
ciates with the Markovian properties of random variables across two groups. A simple 
one-shot convex optimization procedure is proposed for learning the sparse factoriza¬ 
tions of the partitioned ratio and it is theoretically guaranteed to recover the correct 
inter-group structure under mild conditions. The performance of the proposed method 
is experimentally compared with the state of the art MN structure learning methods 
using ROC curves. Real applications on analyzing bipartisanship in US congress and 
pairwise DNA/time-series alignments are also reported. 


1 Introduction 


An undirected graphical model, or a Markov Network (MN) (Roller & Friedman, 2009 


Wainwright & Jordan, 2008) has a wide range of applications in real world, such as natural 
language processing, computer vision, and computational biology. The structure of MN, 
which encodes the interactions among random variables, is one of the key interests of MN 
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Figure 1: An illustration of a full MN (left) and PMN (right). Full MN models all the con¬ 
nections among random variables, while PMN only models the interactions between groups 
(red edges) and does not care connections within groups. 


learning tasks. However, on a high-dimensional dataset, learning the full MN structure can 
be cumbersum since we may not have enough knowledge to model the entire MN, or our 
application only concerns a specihc portion of the MN structure. 

Rather than considering the full MN structure over the complete set of random variables, 
we focus on learning a portion of the MN structure that links two groups of random variables, 
namely the Partitioned Markov Network (PMN). PMN is suitable for describing the “inter¬ 
group relations”. For example, politicians in US Congress are naturally grouped into two 
parties (Democrats and Republicans). Learning a PMN on congresspersons via their voting 
records will reveal bipartisan collaborations among them. A full gene network may have 
complicated structure. However if genes can be clustered into a few homologous groups, 
PMN can help us understand how genes in different functioning groups interact with each 
other. An illustration of a full MN and a PMN is shown in Figure 

Since a PMN can be regarded as a “sub-structure” of a full MN, a naive approach may 
be learning a full MN over the complete set of random variables and hguring out its PMN. 
In fact, the machine learning community has seen huge progresses on learning the sparse 


structures of MNs, thanks to the pioneer works on sparsity inducing norms (Tibshirani 


1996 

Zhao & Yu, 

2006 

Wainwright 

2009) 


A majority of the previous works fall into the category of the regularized maximum 
likelihood approach which maximizes the likelihood function of a probabilistic model under 
sparsity constrains. Graphical lasso (Friedman et ah, 2008 Banerjee et ah, 2008) considers 
a joint Gaussian model parameterized by the inverse covariance matrix, where zero elements 
indicate the conditional independence among random variables, while others have developed 


useful variations of graphical lasso in order to loosen the Gaussianity assumed on data (Liu 


et al., 2009; Loh & Wainwright, 2012). SKEPTIG (Liu et ah, 2012) is a semi-parametric 


approach that replaces the covariance matrix with the correlation matrix, such as Kendall’s 
Tan in MN learning. 

The latest advances along this line of research has been made by considering a node-wise 
conditional probabilistic model. Instead of learning all the structures in one shot, such a 
method focuses on learning the neighborhood structure of a single random variable at a 
time. Maximizing the conditional likelihood leads to simple logistic regression (in the case 
of the Ising model) (Ravikumar et al., 2010) or linear regression (in the case of the Gaussian 
model) (Meinshausen & Biihlmann, 2006). 

Unfortunately, the maximum (conditional) likelihood method can be difficult to compute 


2 












































for general non-Gaussian graphical models, since computing the normalization term is in 


general intractable. Though one may use sampling such as Monte-carlo methods (Robert & 


Casella, 2005) to approximate the normalization term, there is no universal guideline telling 


how to choose sampling parameters so that the approximation error is minimized. 

A more severe problem is that sparsity approaches may have difficulties when learning 
a dense MN. Specihcally, the samples size required for a successful structure recovery grows 


quadratically with the number of connected neighbors (Raskutti et ah, 2009 Ravikumar 


et al., 2010). However, it is quite reasonable to assume that in some applications, one node 


may have many neighbors within its own group while connections to the other group are 
sparse: a congressperson is very well connected to other members inside his/her party but 
has only a few links with the opposition party. Genes in a homologous group may have dense 
structure but they only interact with another group of genes via a few ties. 

Is there a way to directly obtain the PMN structure? Neither maximizing a joint nor 
conditional likelihood take the “partition information” into account and interactions are 
modelled globally. However PMN encodes only the local conditional independence between 
groups, and the requirement for obtaining a good estimator should be much milder. 

The above intuition leads us to a novel concept of the Partitioned Ratio (PR). Given a set 
of partitioned random variables X = (XI, X2), PR is the ratio between the joint probability 
P{X) and the product between its marginals P(X1)P{X2), i.e. ■ In the same 


P{Xl)P{X2) ■ 

way that the joint distribution can be decomposed into clique potentials of MN, we prove 
PR also factorizes over subgraph structures called passages, which indicate the connectivity 
between two groups of random variables XI and X2 in a PMN. 

Gonventionally, PR is a measure of the independence between two sets of random vari¬ 
ables. In this paper, we show that the factorization of this quantity indicates the linkage 
between two groups of random variables, which is a natural extension of the regular usage 
of PR. 

Most importantly, we show the sparse factorization of this quantity may be learned via a 
one shot convex optimization procedure, which can be solved efficiently even for the general, 
non-Gaussian distributions. The correct recovery of sparse passage structure is theoretically 
guaranteed under the assumption that the sample size increases with the number of passages 
which is not related to the structure density of the entire MN. 

This paper is organized as follows. In Section we review the Hammersley and Glifford 
theorem (Section 2.1) and dehne some notations as preliminaries (Section |2.2[ ). The factor¬ 
ization theorems of PMN are introduced in Section with a few simplihcations. We give an 
estimator to obtain the sparse factorization of PR in Section]^ and prove its recovered struc¬ 
ture is consistent in Section]^ Finally, experimental results on both artihcial and real-world 
datasets are reported in Section]^ 


2 Background and Preliminaries 

In this section, we review the factorization theorems of MN. We limit our discussions on 
strictly positive distributions from now on. A graph is always assumed to be hnite, simple. 
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and undirected. 


2.1 Background and Motivation 


Definition! (MN). For a joint probability P{X) of random variables X = {Xi, X 2 ,..., X^}, 
if for all i, P(Xj|\Xj) = where Xjv(j) is the neighbors of node Xi in graph G, 

then P is an MN with respect to G. 

Definition 2 (Gibbs Distribution). For a joint distribution P on a set of random variables 
X, if the joint density can be factorized as 

= ^ n ‘I’oiXc), 

ceC(G) 


where Z is the normalization term, C(G) is the set of complete subgraphs of G and each 
factor (fc is defined only on a subset of random variables Xq, then P is called a Gibbs 
distribution that factorizes over G. 

Theorem 1 (See e.g. 


Hammersley & Clifford (1971)). If P is an MN with respect to G 


(Definition^^, then P is a Gibbs distribution that factorizes over G (Definition^ 

Theorem 2 (See e.g., Koller & Friedman (2009)). If P is a Gibbs distribution that factorizes 
over G then P is an MN with respect to G. 

Theorems and are the keystones of many MN structure learning methods. It states, 
by learning a sparse factorization of a joint distribution, we are able to spot the structure of 
a graphical model. However, learning a joint distribution has never been an easy task due 
to the normalization issue and if the task is to learn a PMN that only concerns conditional 
independence across two groups, such an approach seems to “solve a more general task as 


an intermediate step” (Vapnik, 1998) 


Does there exist an alternative to the joint distribution, whose factorization relates to 
the structure of PMN? Ideally, such factorization should be efficiently estimated from sam¬ 
ples with a tractable normalization term and the estimation procedure should provide good 
statistical guarantees. 

In the rest of the paper, we show PR has the desired properties to indicate the structure 
of a PMN: It is factorized over the structure of a PMN (Section]^ and easy to estimate from 
joint samples (Section]^ with good statistical properties (Section]^. 


2.2 Definitions 

Notations. Sets are denoted by upper-case letters, e.g.. A, B. An upper-case with a lower¬ 
case subscript Ai means the Ath element in A. Set operator A\B means excluding set B 
from set A. \B means the whole set excluding the set B. A = {Al,A2) is a partition of 
set A and an upper-case followed by an integer number, e.g. A1,A2 means groups divided 
by such a partition. Given a graph L = {N, E) and a subgraph K L, Nk or Ek denotes 
the subset of 77 or i? whose elements are indexed topologically by K. Upper-case with bold 
font, e.g. K, is a set of sets. 
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Figure 2: If (I) is an MN over X, then (I), (II), (III) are all PMNs over X. If (I) is a PMN 
over X, (I), (II), (III) are not necessarily the MN over X (but still PMNs over X). 


XI X2 XI X2 



Figure 3: (Left) ABCD and (Right) AB... Z are two passages. 

PMN and Gibbs Partitioned Ratio. Now, we formally dehne a graph G = {X,E), 
where X is a set of random variables and X = (XI, X2), i.e. XI n X2 = 0, XI U X2 = X 
and XI, X2 ^ 0. The concept of PMN can now be dehned. 

Definition 3 (PMN). For a joint prohahility P{X), X = (XI, X2), if 

F(X,|X1 ux^,p)\Xi) = P(X,|\X,),VX, e XI, (i) 

P{X,\X2UXN(i)\Xi) = P(Xi|\X,),VX, e X2, (2) 

then P is a PMN with respect to G. 

The following proposition is a consequence of Dehnition and an example is visualized 
in Figure 

Proposition 1. If P is an MN with respect to G, then P is a PMN with respect to G, but 
not vice versa. 
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Proposition 2. If P is a PMN with respect to G, G XI, G X2, and v ^ N{u), then 

x„xx,|\{x„,x4. 

See Appendix for the proof. 

The concept of Passage is defined as follows: 

Definition 4 (Passage). Let X = (XI, X2). We define a passage B of G as a subgraph 
of G, such that Xb fl XI 7 ^ 0, Xb fl X2 7 ^ 0, and VX„ G (XI fl X^), VX^, G (X2 fl X^), we 
have edge (Xu,X^) G Eb- 

Here we highlight two of the passage structures of two graphs in Figure 
From definition, we can see all cliques that go across two groups are passages, but not 
all passages are cliques: 

Proposition 3. Let X = (XI,X2). Given a passage B of G, B is a complete subgraph 
if and only if\/Xu,Xy G Xb H XI, edge (X„,X^) G Eb and VX„,X^ G Xb H X2, edge 

{x^,x,)eEB. 

As an analogy to a Gibbs distribution used in the Hammersley-Clifford Theorem, we 
define the Gibbs partitioned ratio. 

Definition 5 (Gibbs Partitioned Ratio). For a joint distribution P over X = (XI, X2), if 
the partitioned ratio has the form 


P(X1,X2) _ 1 
F(X1)P(X2) “ Z 


0s(-Ab), 


B6B(G) 

PX1X2) 


where B(G) is the set of all passages in G, then p(xi)p(X 2 ) called the Gibbs partitioned 
ratio (GPR) overG. 


3 Factorization over Passages 

In this section, we will investigate the question: can we have a similar factorization theorem 
like Theorems [T] and for PMN? If so, learning the sparse factorization of PR may reveal 
the Markovian properties among random variables. 


3.1 Fundamental Properties 

There are two steps for introducing our factorization theorems. The hrst step is establishing 
the Markovian property of random variables using the factorization of PR. 

Theorem 3. Given X = (XI, X2), if PR ® GPR over a graph G then P is a 

PMN with respect to G. 

See Appendix for the proof. 

Next, let us prove the other direction: From the Markovian property to the factorization. 
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Theorem 4. Given X = (XI, X2), 
^s a GPR over G. 


P{Xl)P{X2) 


if P is a PMN with respective to a graph G, 


then 


See Appendix for the proof. 

Simply, the factorization of a GPR is only related to the “linkage” (or rigorously, passages) 
between two groups. Interestingly, if we have an MN whose groups are linked via a few 
“bottleneck” passages, then the factorization is simply over those sparse passages, no matter 
how densely the graph are connected within each group. This gives PMN a signihcant 
advantage over traditional MN in terms of modelling: If the interactions between groups are 
simple (e.g. linear), we do not need to care the interactions within groups, even if they are 
highly complicated (e.g. non-linear). For example, in the bipartisan analysis problem, a PR 
over congresspersons can be represented only via a few cross-party links, and a large chunk 
of connections between congresspersons within their own party can be ignored, no matter 
how complicated they are. 

Theorems and 1^ point out a promising direction for structural learning of a PMN: Once 
the sparse factorization of a GPR is learned, we are able to recover the sparse passages of a 
PMN partitioned into two groups. 


3.2 Simplification of Passage Factorization 

The Hammersley-Glifford theorem (Theorem]^ shows P factorizes over cliques of G, given P 
is an MN with respect to G. However, if one does not know the maximum size of cliques, the 
model of a probability function has to consider factors on all potential cliques, i.e., all subsets 
of X. It is unrealistic to construct a model with 2^^^ factors under the high-dimensional 
setting. 

Therefore, a popular assumption called “pairwise MN” (Roller & Friedman, 2009 Mur¬ 


phy , 2012) has been widely used to lower the computational burden of MN structure learning. 


It assumes that in P, all clique factors can be further recovered using only bivariate and uni¬ 
variate components which give rise to a pairwise model with only (|Xp -|- |X|)/2 factors. 
Some well known MNs, such as Gaussian MN and Ising model are all examples of pairwise 
MNs. 

Similar issues also happen when modelling GPR. There are (2^11 — 1 )( 2 A 2 | _ x) possible 
passage potentials for the set of random variables X = (XI, X2). Following the same spirit, 
we can consider a simplihed model of PR by assuming that all passage potentials of the GPR 
must factorize in a pairwise fashion, i.e.: 


Definition 6 (Pairwise PR). For a joint distribution P over X = (XI, X2), if the partitioned 
ratio has the form 

P(.X1,X2) 1 „ 

P(X1)P(X2) Z II 

n n KAXu,x,), 

S6B(G) Xu,X^eXB,u<v 
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pairwise Gibbs partitioned ratio (pairwise PR) over G. 

If we can assume the GPR we hope to learn is also a pairwise PR, the model may only 
contain (|Xp + |X|)/2 pairwise factors, and is much easier to construct. 

In fact, pairwise PR does not have straightforward relationship with pairwise MN, i.e., 
a PR of a pairwise MN may not be a pairwise PR, meanwhile the joint distribution cor¬ 
responding to a pairwise PR may not be a pairwise MN, since the pairwise MN and the 
pairwise PR apply the same assumption on the parameterizations of two fundamentally 
different quantities, the joint probability and the PR respectively. 

Whether one should impose such an assumption on joint probability or PR is totally 
up to the application, as neither parameterization is always superior to the other. If the 
application focuses on learning the connections between two groups, we believe imposing 
such an assumption on PR directly is more sensible. 

However, as a special case, a joint Gaussian distribution is a pairwise MN, and its PR is 
also a pairwise PR. 


Proposition 4. If P over X = {XI, X2) is a zero-mean Gaussian distirbution, then the PR 

P{X1,X2) ■ • • DD 

p{xi)P(X 2 ) ® pairwise PR. 


Since the Gaussian distribution factorizes over pairwise potentials, and the marginal 
distribution P{X1) and P{X2) are still Gaussian distributions. From the construction of 
the potential function (|^ in the proof of Theorem]^ we can verify this statement. Moreover, 
one can show it has the pairwise factor hu^vi^u, Xy) = exp(0u „ • X^Xy), where 6u,v is the 
parameter. 

This pairwise assumption together with factorization theorems motivate us to recover 
the structure of PMN by learning a sparse pairwise PR model: For any Xy G XI, Xy G X2, 
if Xy, Xy appear in the same pairwise factor of a PR model, they must be at least involved 
in one of the passage potentials. 


4 Estimating PR from Samples 


To estimate PR using such a model, we require a set of samples 

and each sample vector is a joint sample, i.e. where Xi,X 2 are 

subvectors corresponding to two groups. 

We dehne a log-linear pairwise PR model g{x-, 6 ): 

^ ' u<v 


where Oy^y G is a column vector, 

^ = (^ 1 , 2 ) • • • ) ^ 2 , 3 ) • 
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and t/? is a vector valued feature function t/? : —)• Notice that we still have to model 

all pairwise features in x, but the vast majority of these pairs are going to be nullihed due 
to Theorem]^ if links between two groups are sparse. 

N{0) is dehned as a normalization function of g{x] 6 ): 

N{ 6 ) := jp{xi)p{x 2 ) (3) 

U<V 


where p{xi) and p{x 2 ) are the marginal distributions of p{x), so it is guaranteed that 

J p{xi)p{x2)g{x;e)dx = 1. _ _ 

N{0) in (|^ can be approximated via two-sample U-statistics ( |Hoeffding , 1963) using the 
dataset, 

N{0) « N{0) := -t^yE^pE 

V2/ jjLk 


u<v 


where jg ^ permuted sample: = {x^^\x^ 2 '^). 

Notice that the normalization term N{ 6 ) in ([^ is an integral with respect to a probability 
distribution p{xi)p{x 2 )■ Though we do not have samples directly from such a distribution, 
U-statistics help us “simulate” such an expectation using joint samples. In Maximum Like¬ 
lihood Estimation, density models are in general hard to compute since their normalization 
term is not with respect to a sample distribution. In comparison, N( 6 ) can always be easily 
approximated for any choice of t/?. This gives us the flexibility to consider complicated PR 
models beyond the conventional Gaussian or Ising models. 

This model can be learned via the algorithm of maximum likelihood mutual information 
(MLMI) (Suzuki et al. , |2009 ), by simply minimizing the Kullback-leibler divergence between 
p{x) and peix) = p{xi)p{x 2 )g{x-, 6 ): 


0 = argminKLiplIpe] 
e 


Substitute the model of g{x] 6 ) into the above objective and approximate N{ 6 ) by N{0), 
then the estimated parameter 6 is obtained as 

n 

e = argmin - + log N{e) +C', 

® i=l u<v 

"-V-' 


where C is some constant. From now on, we denote f'MLMi(^) as the negative likelihood 
function. Due to Theorem and our parametrization, if the passages between two groups 
are rare, then 0 is very sparse. Therefore, we may use sparsity inducing group-lasso penalties 


(Yuan & Lin, 2006) to encourage the sparsity on each subvector 6 


U,V 


0 = argmin £mlmi(^) + A ^ 


(4) 


U<V 
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This objective is convex, unconstrained, and can be easily solved by standard sub-gradient 
methods. A is a regularization parameter that can be tuned via cross-validation. 

Now let us dehne the “true parameter” 0*, such that p{x) = q{x)g{x] 6*). The learned 
parameter 0 is an estimate of 0 *, where 0 *^ is non-zero on pairwise features that are at 
least involved in one of the passage potentials. Moreover, as Theorem and Proposition 
show, if Xu G XI and Xy G X2 are not in any of the passage structures, i.e., 0* ^ = 0, then 
X„XX,|\{X„,X4. 

Given the optimization problem (|^, it is natural to consider the structure recovery 
consistency, i.e., under what conditions, the sparsity pattern of 0 is the same as that of 6*1 

5 High-dimensional Structure Recovery Consistency 

To better state the structure recovery consistency theorem, we use new indexing system 
with respect to the sparsity pattern of the parameter. Denoting the pairwise index set as 
H = {{u,v)\u > n}, two sets of subvector indices can be defined as S' = {T G if | ||0*/|| ^ 
0}, = {t" G H I ||0t//|| = 0}. We rewrite the objective Q as 

0 = argmini(0) -|- Xn ^||0f'il+-^n ^ ll^i"l|- (5) 

® t'&s t"eS'= 

Similarly we can define S and S'^. From now on, we simplify iMLMi(0*) as i{6*). 

Now we state our assumptions. 

Assumption 1 (Dependency). The minimum eigenvalue of the submatrix of the log- 
likelihood Hessian is lower-hounded: 

Amin(V 6 »sV 05 i( 0 *)) > Amin > 0, 

with probability 1, where Amin is the minimum-eigenvalue operator of a symmetric matrix 
Assumption 2 (Incoherence). 

mtK II [V»,„ V8,«(r)] |V»,V„,«(eT'||, < l - a, 

with probability 1, where 0 < a < 1, and ||X||i = Yhij 

The hrst two assumptions are common in the literatures of support consistency. The hrst 
assumption guarantees the identifiability of the problem. The second assumption ensures 
the pairwise factors in passages are not too easily affected by those are not in any passages. 
The third assumption states the likelihood function is “well-behaved”. 

Assumption 3 (Smoothness on Likelihood Objective). The log-likelihood ratio i{d) is 
smooth around its optimal value, i.e., it has hounded derivatives 

max |IV^i( 0 * +5)11 < Amax < CXD, 

5,li5||<l!0*|| " " 

max max III Ve, V^i( 0 * + 5) III < A 3 max < oo, 

5,||5||<|!6»*|| 

with probability 1. 
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are the spectral norms of a matrix and a tensor respectively (See e.g., Tomioka 


& Suzuki (2014) for the dehnition of the spectral norm of a tensor). 


Assumption 4 (Bounded PR Model). For any vector S E such that ||<5|| < ||0* 

the following inequality holds: 


0 < C*min ^ 17(^) ^) ^ ^ CXD; 


1 / 


^ Cjf ,me 

tiloo ^ ^ 


and II/J < E (SUS^). 


This assumption simply indicates our PR model is bounded from above and below around 
the optimal value. Though it rules out the Gausssian distribution whose PR is not necessarily 
upper/lower-bounded, as a theory of generic pairwise models, we think it is acceptable. 


Theorem 5. Suppose that Assumptions^ and[^ are satisfied as well as mintg 5 \\0 
Suppose also that the regularization parameter is chosen so that 


ill > 


24(2-a) /Mlog 




< \r 


a 


n 


where M is a positive constant. Then there exist some constants L, Ki and K 2 such that if 
n > L\S\‘^ log with the probability at least 1 — Ki exp {—K 2 XIJ 1 ), MLMI in (|^ has the 

following properties: 

• Unique Solution: The solution of ([^ is unique. 

• Successful Passage Recovery: S = S and S'^ = S'^. 


\e^e* 


. +m 


The proof of Theorem is detailed in Appendix Since the PR function is a density 
ratio function between p{x) and p{xi)p(x 2 ), and (IS) is also a sparsity inducing Kullback- 
Leibler Importance Estimation Procedure (KLIEP) ( Sugiyama et al.| 2008), the previously 
developed support consistency theorem Liu et ah (2015, 2016) can be applied here as long 


as we can verify a few assumptions and lemmas. 

The sample size required for the proposed method increases with log m (since log < 

2 logm if m > 2 ) and the estimation error on 6 vanishes at the speed of They are 

the same as the optimal rates obtained in previous researches for Gaussian graphical model 
structure learning (Ravikumar et ah, 2010 [Raskutti et al. , 2009). 

This theorem also indicates that the sample size required is not influenced by the struc¬ 
tural density of the entire MN structure, but by the number of pairwise factors in the 
passage potentials. This is encouraging since we are allowed to explore PMNs with dense 
groups which would be hard to learn using conventional methods. 
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(c) ROC of Gaussian Dataset (d) ROC of “Diamond” 

Dataset 


Figure 4: Synthetic experiments 


6 Experiments 


Unless specified otherwise, we use pairwise feature function x^) = XuX^. Note this does 
not mean we assume the Gaussianity over the joint distribution, since this is a parameteri¬ 
zation of a PR rather than a joint distribution. 


6.1 Synthetic Datasets 


We are interested in comparing the proposed method with a few possible alternatives: LL 


(Meinshausen & Biihlmann, 2006 Ravikumar et ah, 2010), SKEPTIC (Liu et ah, 2012) and 
Diff dZhao et ab 2014): A direct difference estimation method that learns the differences 


between two MNs without learning each individual precision matrix separately. In this paper, 
we employed this method to learn the differences between two Gaussian densities: p{x) and 
p{xi)p{x2). 

We first generate a set of joint samples ^ A/'(0, ©~^), where 0 G and is 

constructed in two steps. First, create 




pi* i^j < 40 or i, j > 40, 


0 , 


Otherwise, 


where 0 < p < 1 is a coefficient controlling the dominance of the diagonal entries. Second, 
let A be the 15*’^ smallest eigenvalue of ©, and fill the submatrices ©{ 4 i,..., 5 o},{ 3 i,..., 40 } and 
©{ 31 ,..., 40 },{ 41 ,..., 50 } with A/io, where /lo is a 10 x 10 identity matrix. By such a construction, 
we have created two groups over X\ X = (Wji . 40 })-^{ 4 i,..., 50 }) and 10 passages between 
them. Notably, within two groups, the precision matrix is dense, and random variables 
interact with each other via powerful links when p is large. An example of © when p = 0.8 
is plotted in Figure |4(a)[ We measure the performance of three methods using the True 
Postive Rate (TPR) and True Negative Rate (TNR). The detailed definition of TPR and 
TNR is deferred to Appendi x, 

The ROG curve in Figure [4(^ can be plotted by adjusting the sensitivity of each method: 
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Figure 5: Bipartisanship in 109**^ US Senate. Prefix “(D)” or “(R)” indicates the party 
membership of a senator. Red: positive influence, Blue: negative influence. Edge widths are 
proportional to \9u,v\- 


Tuning the regularization parameter of the proposed method and LL, or the threshold pa¬ 
rameter of Diff. 

As we can see, the proposed method has the best overall performance on all p choices, 
comparing to both LL and Diff. Also, as the links within each group get more and more 
powerful (by increasing p), the performance of LL and Diff decay significantly, while the 
proposed method almost remain unchanged. 

As the proposed method is capable of handling complex models, we draw 50 samples from 
a 52-dimensional “diamond” distribution used in (Liu et al., 2014) where the correlation 


among random variables are non-linear. To speed-up the sampling procedure, the graphical 
model of this distribution is constructed by concatenating 13 simple 4-variable MNs whose 
density functions are defined as 


p{xa, Xb, Xc, Xd) OC exp [-pxlxl - .5xbXc - .5xbXd) ■ A/", 


where Af is short for a normal density A/'(0, . 5 / 4 ) over Xa^Xh^Xc and Xd- Notice this dis¬ 
tribution does not have a closed form normalization term. The graphical model of such a 
distribution is illustrated in Figure 4(b) In this experiment, the coefficient p is used to 
control the strength of inter-group interactions {xa -H- Xb), and we set 'il!{xu,Xv) = xl^x^. 
Other than LL, we include SKEPTIC due to the non-Gaussian nature of this dataset. The 


performance is compared in Figure 4(d) using ROC curves. 

The correlation among random variables are completely non-linear. As the power of 
interactions on passages increases, LL performs worse and worse since it still relies on the 
Gaussian model assumption. Thanks to the correct PR model, the proposed method per¬ 
forms reasonably well and gets better when p increases. As the density model does not fit 
into the Gaussian copula model, SKEPTIC also performs poorly. 
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6.2 Bipartisanship in 109*^ US Senate 


We use the proposed method to study the bipartisanship between Democrats and Repub¬ 
licans in the US Senate via the recorded votes. There were totally 100 senators (45 
Democrats and 55 Republicans) casting votes on 645 questions with “yea”, “nay” or “not 
voting”. The task is to discover the cross-party links between senators. We construct 
a dataset ~ X using all 645 questions as observations, where each observation 

X G {1,—1,0}^°° corresponds to the votes on a single question by 100 senators, and random 
variables X = (W{i^,,.^ 45 },X{ 46 ,.,,,ioo}) are senators partitioned according to party member¬ 
ships. 

We run the proposed method directly on this dataset, and decrease A from 10 until 
|R| > 15. To avoid complication, we only plot edges that contain nodes from different 
groups in Figure 

It can be seen that Ben Nelson, a conservative Democrat, who “frequently voting against 
his party” (Wikipedia, 2016a), has multiple links with the other side. On the right. Democrat 
Tom Carper tends to agree with Republican Lincoln Chafee. Carper collaborated with 
Chafee on multiple bipartisan proposals (Press-Release, a]|b) while Chafee, who “support 


for hscal and social policies that often opposed those promoted by the Republican Party” 


Wikipedia (2016b) hnally switched his affiliation to Democratic in 2013. Interestingly, we 


have also observed a cluster of senators who tend to disagree with each other. 


6.3 Pairwise Sequences Alignment 

PMN can also be used to “align” sequences. Given a pair of sequences where points are 
collected from the domain fh, we pick sequence 1 and construct the dataset by sliding a win¬ 
dow sized n toward future, until reaching the end. Suppose there are mi windows generated, 
then we can create a dataset x G Similarly, we construct another dataset 


a; 

After learning a PMN over two groups, if X., 


[A 


G on sequence 2, and make joint samples by letting = f 


U ) -^2 

and X^ are connected, then we regard the 
elements in the u-th window and the elements in the u-th window are “aligned”. See Figure 
in Appendix for an illustration. 

We run the proposed method to learn PMNs over two datasets: Twitter keyword count 


sequences Liu et ah (2013) and Amino acid sequences with Genebank ID: AAD01939 and 


AAQ67266. The results were obtained by decreasing A from 10 so l^l > 15. 

For the Twitter dataset, we collect normalized frequencies of keywords as time-series 
over 8 months, during the event ’’Deepwater Horizon oil spill” in 2010. We learn alignments 
between two pairs of keywords: “Obama” vs. “Spill” and “Spill” vs. “BP”. The results 


are plotted in Figure 6(a) where we can see the sequences of two pairs are aligned well in 
chronological order. The two popular keywords, “BP” and “Spill” are synchronized through¬ 
out almost the entire event while “Spill” and “Obama” are only synchronized later on after 
he delivered his speech in Oval Office on this crisis on June 15th, 2010. 

The next experiment uses two amino acid string sequences, consisting codes such as ‘V’, 


‘F, ‘L’ and ‘F’, etc. Figure 6(b) shows that the proposed method has successfully identihed 
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June 15, Obama addresses at Oval Office; June 16, Obama meets BP executives. 


"Obama" 


"Spill" 


"BP" 






mk 

_ 




Feb 1 Mar 1 Apr 1 May 1 Jun 1 Jul 1 Aug 1 Sep 1 Oct 1 

(a) Twitter keyword frequency time-series alignments, n = 50, m = 962 and 
A’ = K. 


FLY 


ll■l■lll■■MIIII■lllll■llll llllllllllllllllll I IIHI 




- Intervals found by Needleman-Wunsch (NW) algorithm 
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(b) Amino acid sequence alignments between A ADO 1939 (human) and 
AAQ67266 (fly), n = 10, m = 592,(j){xi,Xj) = S(xi,Xj) and X = 

{amino acid dictionary}. 

Figure 6: Sequence alignment. For two aligned windows with size n, we plot n gray lines 
between two windows linking each pair of elements. Since lines are so close to each other, 
they look like “gray shades” on the plot. The color box contains the region of consecutively 
aligned windows. 
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the aligned segment between eyeless gene of Drosophila melanogaster (a frnitfly) and hnman 
aniridia genes. The same segment is also spotted by widely used Needleman-Wunsch (NW) 


algorithm (Needleman & Wunsch, 1970) with statistical signihcance. 
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A Proof of Proposition 

Proof. For G XI, 


P{X^\\Xu) 


P{Xu, X\Ar(„) n X2|X1 U Xn^u)\Xu) 
P{X\N{u) n X2|xi u Xn{u)\Xu) 


Since P{Xu\\Xu) = P(X„|X1 UX 7 v(u)\X„) using the Markovian property of PMN, substi¬ 
tuting it to the above equation, we have X„ i X\ 7 v(tj) H X2|X1 U XAr(„)\X„. 

Xv ^ XjsK^u) me ans X^ € X\]si(u) H X 2. Using the weak union rule for conditional inde¬ 
pendence (see e.g., (Roller & Friedman, 2009), 2.1.4.3), we obtain X„ i X„|\{X„,X^}. 

For Xu G X2, the proof is the same. □ 


B Proof of Theorem |3| 

Proof. We dehne that B('i) is the set of passages contains X*. Here we only show the proof 
that Eq. ([^ holds for GPR. Let’s denote cfs as short for 0 b(Xs). 

F(X,|XlUX^p)\W) 

^ ■^^\]v(i)nX2 F(X1)F(X2) 

z fx, /x\jv(i)nx 2 P(Xl)P(X2)nB,B(G)^B 

\ fxi nBeB(fi) J Y /x\jv(i)nX2 P{X2) nBe\B(fi) ^b j 
_ PiX^) riseBp) '^B 
fxi riseBp) ^B 

_ PiX^)Y[BeB{i) ^B z^(^‘^')Y\.B£\B{i)^B 

Ixi riseBp) ^PiX2) nBe\B(i) ^b 

=F(W|\W), 


from which, we obtain the desired equality. Note that we used the fact that XB(i) H (X\Ar(i) 
X2) = 0 from the second to the third and fourth line. 

C Proof of Theorem HI 

Proof. This proof is constructive. Let’s clarify some notations used in this proof. Lower¬ 
case bold letter a is a vector-realization of a set of random variables A. P{ax, c) means 
the probability of a realization where elements appearing on positions indexed by subgraph 
K are allowed to take random values, while other elements are hxed to value c G dom(X). 
Note K might be 0. We denote F1(X) as the equivalency of marginal F(X1). 
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First we define the following potential fnnction: 


MXs = = n 

zcs 


where S' is a snbset of G, and 


Xz{xz) 


P{xz,c) 

Pl{xz,c)P2{xz,c) ’ 
1 


3B e B{G),B C Z, 
otherwise, 


( 6 ) 


First we show by constrnction, the mnltiplication of all potential fnnctions over all snb- 
graph structures, i.e., Hscg'^s' actually give us the PR. 

Due to the inclusion-exclusion principle (see, e.g.Koller & Friedman (2009), 4.4.2.1), it 
can be shown that 


JJ (j)s{Xs = xs) = ^g{x). 

SCG 


If the graph G contains any passage, then by dehnition AG(a;) = pi(^)^ 2 {x) ’ 'which is exactly 
the PR. However, if G does not include any passage, meaning Xi is completely independent 
of X 2 , then Ag{x) = 1 by dehnition, which is the exact value that a PR would take in such 
case. 

Second, we show this construction under PMN condition is actually a GPR. Specihcally, 
we show if S is not a passage, then 0s(X 5 = xs) = 1, i.e. its potential function is nullihed. 

Obviously, for a “one-sided S'”, Xs O XI = 0 or Xs O X2 = 0, by dehnition, 0s = 1- 

Otherwise, if S are “two-sided” but itself is not a passage, we should be able to hnd two 
nodes, indexed by X^ G XI fl Xs and X^ e X2 0 Xs, that are not connected by an edge. 
'We may write the potential function for a subgraph S as 


0s(Xs = xs)= n 


Xm iXWVJ{u,v}^ 

Xwu{u} {Xwu{u})Xwu{v} 


where * means we do not care the exact power which can be either -1 or 1 , and 

Aw{Xw)Xwu{u,v}(yXw) _ PwPwu{u,v} P‘^WU{v}P‘^wP^WU{u}P^W 
Avi/u{«}(*mu{n})Awu{t;}(*iyu{u}) Pwu{u}Pwu{v} P^wP‘^wP^wu{u}P‘^wu{v} 

where we have simplihed the notation P{xa,c) as Pa- The second factor in RHS, ([^ is 
apparently 1. For the hrst factor in RHS, ([^, we may divide both the numerator and 
denominator by Pw ■ Pw- Then it yields c) "which equals to one if and only 

if Xu X X^|\{X„,X.„}. This is guaranteed by PMN condition and Proposition]^ □ 
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D Proof of Theorem [5l 


Since the PR is a density ratio between the joint density p{xi,X 2 ) and the product of 
two marginals p{xi)p{x 2 )-i and the o bjective p ) is derived from the same sparsity inducing 
KLIEP criteria as it was discussed in Liu et ah (2015, 2016). The proof of Theorem [^follows 
the primal-dual witness procedure (Wainwright, 2009). 

First, the Assumptions Bi and we have made in Section is essentially the same 
as those were imposed in Section 3.2 in Liu et al. (2016) (The Hessian of the negative 


log-likelihood is the sample Fisher information matrix). Then the proof follows the steps 


established in Section 4, Liu et al. (2016). However, the only thing we need to verify is that 


max^ ||V0j£(0*)|| is upper-bounded with high probability as n —)■ cx). We formally state this 
in the following lemma: 


clog(m^+m)/2 
n ^ 

(yXn 


< 3 exp (—c"n), 


Lemma 1. If \n> 

where c and c" are some constants. 

Proof. For conveniences, let’s denote the approximated PR model exp ^u,v 'il^{x^,v))/N{6) 

as g{x-,e). Since g{x-,e) = ^^g{x-,e), and ^ Ej/fc is always bounded 

by [Cmin, C'max], we cau see g{x;6) is also bounded. For simplicity, we write 

0 < < g{x; e) < (PEx < oo- 

We have 


^oAo*) = 


n 


2 = 1 


X 


{ih 


+ 


V2/ j<k 


First we show that ||V6»t^(0*)i| can be upper-bounded as: 


liv./(r)|| < 




2=1 


\2/ j^k ^2/ j^k 


12/ jyfc 
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We now need Hoeffding inequality Hoeffding (1963) for bounded-norm vector random vari¬ 
ables which has appeared in previous literatures such as Steinwart & Christmann (2008): 
For a set of bounded zero-mean vector-valued random variable ||?/|| < c, we have 


Pi 


2=1 


> ne) < exp 


—ne 

2c2 


for all e > Now it is easy to see 

Pio^n > e) < exp 

as long as 


2ne^ 


^/2 


( 8 ) 


e > 


C' 


2A/n 


(9) 


As to bn, it can be upper-bounded by 
1 


bn = 


\2/ j^k V2/ 

^i^ > ( 2 ) ( 2 ) j^k 


< 


© 




j¥=k 



N{e*) 


N{e*) 


< C C'f 

— max Jt,max 


V2/ j^k 


and due to Hoeffding inequality of the U-statistics (see ( |Hoeffdin^ |1963[ ) , 5b) we may obtain: 

P{bn > e) < 2 exp 


2ne^ 


(~^2 nn (^f2 

'^max'^max^/^,max 


( 10 ) 


As to we first bound its Ath element Wi^n using Hoeffding inequality for U-statistics, 

thus by using the union bound, we have 

Pi\\wn\\oo > e) < 26exp 


(~<2 (~<2 

max*-^,max 


2 nbe‘^ 


(~<2 f<2 

'-'max'-'/j,max 
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we have 


and since \\Wn II < ^11 


Wr 


PiW'WnW > e) < P{Vb\\Wn\\oc > f) < 26 exp 


(- 


2 ne^ 


( 11 ) 


I /^2 (^2 

Therefore, combining ([^, (10) and ( [II| ): 

F(||V 0 t£( 0 *)|| > 3e) < P{an + 6 n + c„ > 3e) < c"exp > 

where d is a constant dehned as d = max and 

2C' 

d' = 2b + 3, given e > —Applying the nnion-bonnd for all f G S' U S''^, 


P(maxJ|Ve,£(r)|| >3e)< 


d'{m^ + m) I ne" 




p( max ||V 0 /(r)|| > 

\tesuS'=" 4(2 - a) 


and when 

— a \ n ’ 

( 


d'im? + m) 

< ---exp 


(y\r 


12(2 -a)J d r 


aXr 


F max ||V 0 ,£(r)|| > , 

\tesus=' 4(2 - a) 


^ / /// \ 
< c exp (—c n ), 


where d” is a constant. Assnme that log > 1 and we set as 


A„. > 


24(2-a) /(c' + C'l^ Jlog(m2 + m)/2 


a 


n 


n 


then ([^, the condition of nsing vector Hoeffding-ineqnality is satished. □ 

Given Lemma [T| we may obtain other technical resnlts, snch as the estimation error 
bonnd, nsing the same proof as it was demonstrated in Section 4, Lin et al. (2016). 


E Experimental Settings 

We measnre the performance of three methods nsing Trne Postive Rate (TPR) and Trne 


Negative Rate (TNR) that are nsed in Zhao et al. (2014). The TPR and TFR are dehned 
as: 

rjpj^ ^ Xt'gg 7^ 0) ^ = 0) 


Xt'es 7^ 0) 
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Figure 7: The illustration of sequence matching problem formulation. 


where 6 is the indicator function. 


The differential learning method (Zhao et ah, 2014) used in Section 6.1 learns the differ¬ 


ence between two precision matrices. In our setting, if one can learn the difference between 
the precision matrices of p{x) and p{xi)p{x 2 ), one can figure out all edges that go across 
two groups {xi and X 2 ). 

This method requires sample covariance matrices of p{x) and p{xi)p{x 2 ) respectively. 
The sample covariance of p{x) is easy to compute given joint samples. However, to ob¬ 


tain the sample covariance of p{xi)p{x 2 )^ we would again need the U-statistics (Hoeffding 


1963) introduced in line Section]^ We may approximate the M,u-th element of the covari¬ 


ance matrix of p{xi)p{x 2 ) using the formula: 
distribution has zero mean. 


— /nN Z] 




[iM [iM 


assuming the joint 


F Illustration of Sequence Matching 

We plot the illustrations of our sequence matching problem formulation from two sequences 
in Figure 
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